Solving the Parquet Equations for the Hubbard Model beyond Weak Coupling 
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We find that imposing the crossing symmetry in the iteration process considerably extends the 
range of convergence for solutions of the parquet equations for the Hubbard model. When the 
crossing symmetry is not imposed, the convergence of both simple iteration and more complicated 
continuous loading (homotopy) methods are limited to high temperatures and weak interactions. 
We modify the algorithm to impose the crossing symmetry without increasing the computational 
complexity. We also imposed time reversal and a subset of the point group symmetries, but they 
did not further improve the convergence. We elaborate the details of the latency hiding scheme 
which can significantly improve the performance in the computational implementation. With these 
modifications, stable solutions for the parquet equations can be obtained by iteration more quickly 
even for values of the interaction that are a significant fraction of the bandwidth and for temperatures 
that are much smaller than the bandwidth. This may represent a crucial step towards the solution 
of two-particle field theories for correlated electron models. 

PACS numbers: 71.10.Fd,71.10.-w,71.27.+a,71.30.+h 



I. INTRODUCTION 

A natural step to extend most of the existing many- 
body single-particle self-consistent methods is to include 
the full momentum and energy dependence of the ver- 
tex corrections. Historically, the self-consistent approach 
for vertex corrections was first considered by Landau, 
Abrikosov and Khalatnikov in the context of the high 
energy behavior of quantum electrodynamics i The origi- 
nal goal was to develop a non-perturbativc method which 
encodes the information in terms of a system of closed 
integral equations. The parquet equations, in principle, 
provide a framework for self-consistent determination of 
the self-energy and the vertex corrections. They were 
proposed for both boson-boson scattering and fermion- 
fermion scattering during the 1950's^ Methods similar 
to the parquet equations were first introduced in the con- 
text of many-body theory by de Dominies and Martin.^ 
One of the early practical applications was on the x-ray 
absorption and emission problem by Roulet, Gavoret, 
and Nozieres£ Since then, various problems have been 
studied by the parquet summation approach, most no- 
tably, the Fermi liquid in a strong magnetic fieldj^r— the 
disordered electron gas in a strong transverse magnetic 
fields the Anderson impurity model^ - — random po- 
tential problems? 1 ^— the Hubbard modelji2r— Helium- 
4 ) 24 i 25 Helium-3; 2 ^ local moment formation ) 27 ! 28 the vor- 
tex liquid modelj 2 ^— the matrix models ; 33 ' 34 and nu- 
clear structure calculations^ While these applications of 
parquet formulation provide a lot of important insights, 
most of the calculations are based on various approxi- 
mated forms of the parquet equations. 

It is obvious that going from a one particle to a two- 
particle self-consistent calculation represents a significant 
increase in the computational effort, as each two-particle 



vertex contains three independent momentum and fre- 
quency indices. From the point of view of practical cal- 
culation, the number of elements for each index is around 
a few thousands. Therefore, the number of elements for 
the vertices are around tens of millions to a few billions. 
Moreover, all the information is encoded in integral equa- 
tions with complicated structure, in which simplification 
does not seem to be immediately possible. Indeed, in the 
past, the most successful application using the full par- 
quet equations was largely limited to the single Ander- 
son impurity model.— With recent advances in compu- 
tational infrastructure where peta-scale performance has 
become available, calculations for lattice models, such as 
the Hubbard model are now feasible. For example, the 
solution of the parquet equations for a 4 x 4 Hubbard 
cluster with on-site coupling U = 2t and temperature 
T = 0.3t was recently obtained^ 

However, limitations on computer performance and 
storage are apparently not the sole obstacles for obtain- 
ing the solution of the parquet equations. Another major 
barrier is the stability of the solvers. The simple itera- 
tion method, which is widely adapted for the dynamical 
mean field method, often fails to provide a stable solu- 
tion for the parquet equations. In most damp- 
ing scheme has to be employed. Even with the damping 
scheme, when the temperature is low or the coupling is 
large, finding a stable solution still seems to be rather 
difficult^ 

Given the large number of variables and the complex- 
ity of the parquet equations, instabilities in their solution 
may not be unexpected. Methods based on the local gra- 
dient are not likely to be suitable as the Hessian cannot 
be readily calculated. Most of the non-linear solvers only 
have local convergence properties. This may not pose 
a problem if we have a reasonable guess which is close 
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enough to the true solution. Unfortunately, it is not easy 
to obtain a good initial guess for the parquet equations. 
Methods that in principle allow "global" convergence, for 
example, the homotopy method or continuous loading 
method, have been proposed as possible ways to improve 
the calculations.— While these tend to improve conver- 
gence, many steps are required for the solution to move 
along the homotopy path. Moreover, practical experi- 
ence seems to suggest that convergence may still not be 
achieved when the temperature is low or the coupling 
is strong. It is clear that a better solver is definitely re- 
quired for the practical application of the parquet method 
within the context of the strongly correlated systems. 

One of the most prominent differences between the par- 
quet formulation and most of the other approximation 
schemes such as RPA ) 38 i 39 self-consistent spin fluctua- 
tions approach,— and fluctuation exchange approach-^ 
is that the so-called crossing symmetry is obeyed by 
construction of the parquet equations. The crossing 
symmetry 4 ^ implies that a vertex in one channel can also 
produce the vertex in all other channels by pulling or 
crossing the vertex legs and multiplying by appropriate 
constants. It also implies the Pauli exclusion principle 
is automatically satisfied. However, in the course of the 
iteration process, as long as the exact solution of the par- 
quet equations is not obtained, the crossing symmetry is 
violated. The main point in the present paper is to high- 
light that the crossing symmetry is crucial for obtaining 
a stable solution. We devise a modified iteration scheme 
which can obtain a stable solution for the parquet equa- 
tions at lower temperature and stronger coupling than 
that from the previous schemes^ This is achieved pri- 
marily by restoring the crossing symmetry at each step 
of the iteration. 

It is important to notice that because of the large 
number of vertex functions, for production runs, mas- 
sively parallel machines are absolutely necessary. Since 
the vertex functions in different channels are mixed in 
the parquet formulation, an efficient scheme to transform 
the vertex functions storage in different nodes is criti- 
cal to improving the overall efficiency of the calculations. 
Some of the computational details have been explained 
in the previous publication^ We have further improved 
the scheme which allows us to hide the communication 
latency across different nodes behind the local calcula- 
tions within the nodes, which effectively further speeds 
up the calculations. 

The model used for testing the computational scheme 
for solving the parquet equations is the standard Hub- 
bard model at half-filling. The Hamiltonian is 

h = -t ( c L c ^ + H - c -) + Uj2 n ^,h (!) 

<i,j> i 

where U is the on-site repulsion and t = 1 is the hopping 
matrix which establishes a unit of energy. 

The paper is organized as follows. In Section II, we 
reproduce the parquet equation which also serves to fix 
the notation. In Section III, we describe the iteration 



scheme for solving the parquet equations. In Section IV, 
we discuss the violation of crossing symmetry. In Section 
V, we present a modified iteration scheme which explic- 
itly restores the crossing symmetry. We also discuss the 
limitation of the modified scheme and the possible direc- 
tions for further developments. In Section VI, we present 
the leading eigenvalues of the antifcrromagnctic channel 
as a function of the temperature and coupling strength 
and find that the parameter region of stable solutions 
is greatly increased when the crossing symmetry is en- 
forced. A brief summary is contained in Section VII. In 
the Appendix, we present a latency hiding scheme which 
allows substantial increase of the efficiency for solving the 
parquet equations. 



II. PARQUET EQUATIONS 

In order for the current paper to be reasonably self- 
contained, we provide in this section a brief description 
of the parquet equations, largely following Ref. The 
purpose is to highlight the structure of the equations and 
to define the notations that will be used in the subsequent 
sections. We do not intend to provide a derivation of the 
formulation. For this we refer the readers to the literature 
where ample discussions can be found i 18 ' 19 ' 23 ' 36 ' 43 ' 44 

Standard perturbative expansions attempt to describe 
all the scattering processes, at the lowest orders, as 
single- or two-particle Feynman diagrams. In the single- 
particle formulation the self-energy describes the many- 
body processes that renormalize the motion of a parti- 
cle in the interacting background of all the other parti- 
cles. In the two-particle context, one is able to probe the 
interactions between particles using the so-called vertex 
functions, which are matrices describing the two particle 
scattering processes. For example, the reducible (full) 
two-particle vertex F p/t (12;34) describes the amplitude 
of a particle-hole pair scattered from its initial state |3, 4} 
into the final state |1,2). Here, i = 1,2,3,4 represents 
a set of indices which combines the momentum k^, the 
Matsubara frequency iu> ni and, if needed, the spin <7j and 
band index m,. Since the total momentum and energy 
of the vertex are conserved, it is convenient to adapt the 
notation F ph (2 — 3) 1,3 for the numerical implementation 
of the single band Hubbard models 

In general, depending on how particles or holes are 
involved in the scattering processes, one can define three 
different two-particle scattering channels. These are the 
particle-hole (p-h) horizontal channel, the p-h vertical 
channel and the particle-particle (p-p) channel. 

One can further discriminate the vertices according to 
their topology. Starting from the reducible vertex F in- 
troduced above, we may define the irreducible vertex T 
corresponding to the subclass of diagrams in F that can- 
not be separated into two parts by cutting two horizontal 
Green function lines. Similarly, the fully irreducible ver- 
tex A corresponds to the subclass of diagrams in V that 
cannot be split into two parts by cutting two Green func- 
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tion lines in any channel. 

Furthermore, since we are mostly interested in mod- 
els that preserve the SU (2) spin rotation symmetry, and 
since this is an exact symmetry for our two-dimensional 
calculations at non-zero temperature, it is convenient to 
preserve this symmetry. This is accomplished by de- 
composing the vertices in the so-called spin-diagonalizcd 



representation 



18,19 



In this representation, the spin de- 



grees of freedom decompose the particle-hole channel into 
the density and the magnetic channels, and the particle- 
particle channel into the spin singlet and the spin triplet 
channels which we denote as d-channel, m-channel, s- 
channel, and ^-channel respectively. They are defined as 
follows, 
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and similarly for F and A. 

We reproduce the full set of equations for the parquet 
formulation in the spin diagonalized representation in the 
following . 18 i 23 i 36 The Schwinger-Dyson equation is 



UT 2 

S(P) = - — Y, {G(P')G(P' + Q)G(P - Q){F d {Q)p- Q , P , - F m (Q) P . Q , P ,) 
P',Q 

+G{-P')G{P' + Q)G(-P + Q){F S (Q) P „ Q , P , + F t (Q) P _ QiP ,)}, (6) 
I 



where G is the single-particle Green function, which itself 
can be calculated from the self-energy using the Dyson 
equation, 

G-^P) = G^(P) - S(P), (7) 

where Go is the bare Green function. Here, the indices 
P, P' and Q combine momentum k and Matsubara fre- 
quency ioj n , i.e. P = (k, ioJ n )- 

The reducible and the irreducible vertices in a given 
channel are related by the Bethe-Salpeter equation, 

F r {Q)p,P> = Tt(Q)p.p> + M<9)p,p', (8) 

F r , (Q)p.p, = T r , {Q)p :P , + V r , (Q)p P , , (9) 

where r — d or m for the density and magnetic chan- 
nels and r' = s or t for the spin singlet and spin triplet 



channels. The vertex ladders are defined as 

$ r (Q)p, P , = (10) 

J2 F r(Q)P,P"^0 h (Q)p"^r(Q)p",P', 
P" 

*r>(Q)p,P< = (11) 

Y F r' (Q)p,p"X p p (Q)p"r r , {Q)p.., P . , 
p" 

where \o is the product of two single-particle Green func- 
tions. 

The parquet equations in the spin diagonalized repre- 
sentation are 



T d (Q) Pp/ = A d (Q) PP/ - h d (P' - P)p.p+q - l®m(P' - P)p.p+q (12) 

+ i* s (P + P ' + Q)_p_ Qt _p + h t (P + P' + Q)_p_Q,_p, 

r m (Q)pp< = A m (Q) P P> - \^d{P' - P)p,p+q + \^ m { p ' - P)p.p+Q (13) 

- ^ S (P + P' + Q)_P_Q,_P + ^ t (P + P' + Q)-P-Q,-P, 

T s (Q)p P , = A s (Q) PP , + ^ d (P' -P)_p, tP+Q -^ m (P' -P)_ pitP+Q (14) 
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+ \®d(P + P 1 + Q)-P<,-P - \®m{P + P' + Q)-P',-P, 

T t {Q) PP , = A t (Q)pp' + \<i>d(P' -P)-p>.p +Q + l$ m (P' -P)-p',p+q (15) 
- + P' + Q)_p-,_p - \®m{P + P' + Q)-P-,-P- 

I 



It is important to note at this point that if we substi- 
tute the irreducible vertices T fEqs. 1121131141 and [T5)) into 
the Bcthc-Salpeter equation (Eqs. [8] and [9j> the crossing 
symmetry in the full vertex F is automatically satisfied 
regardless of the numerical values of the vertex ladders 



$ and "J, assuming the fully irreducible vertices, A, are 
crossing symmetric. We write all the full vertices explic- 
itly in the following using only the vertex ladders, $, ^ , 
and the fully irreducible vertices, A. 



F d {Q)p, P . = A d (Q) P P> - \<S> d (P' - P)P,P+Q - l^rn(P' - P)P,P+Q (16) 
+ \^s(P + P' + Q)-P-Q,-P + ^%(P + P' + Q)-P-Q,-P + $d(Q)p,P'l 

FUQ)P,P> = Kn{Q) P P> - \<S>d{P' - P)P,P+Q + \*m(P' - P)PP+Q (17) 

- ^ S (P + P' + Q)_ P _ Qi _p + U> t {P + P' + Q)_ P _ Q< _p + $ ro (Q)p, P /; 

Fs(Q)p,P' = K{Q)PP> + \®d(P' ~ P)-P',P+Q-^m(P' - P)-P',P+Q (18) 

+ i$ d (P + p' + Q)-p>,-p - l^m(P + P' + Q)-p',-p + y s (Q)p,p>; 

F t (Q)p,P> = MQ)pp' + \®d(P' - P)-P>,P+Q + \$ m {P' -P)-P',P+Q (19) 
- \®d{P + P' + Q)-P>,-P - \®m{P + P' + Q)-P>,-P + *t(Q)p,p'- 



These relations allow us to restore the crossing symmetry 
for the full vertices without heavy computational over- 
head. 

The prominent technical problem at hand is whether or 
not we can solve this set of equations efficiently without 
resorting to any approximated scheme. An obvious diffi- 
culty is to handle the large number of variables. On go- 
ing from the one-particle level calculation to two-particle 
level calculation, the number of variables which has to be 
monitored grows as the third power of the linear dimen- 
sion of the system. If N t is the number of lattice sites 
times the number of discrete Matsubara frequencies, i.e., 
Nt = Nk x N u , the largest Nt that can be handled is in 
the range 2000 — 3000, i.e., the number of variables can be 



I 

over one billion. One can immediately see that practical 
calculations for reasonably large system sizes pose a seri- 
ous problem, although not insurmountable with modern 
computational facilities where a large number of com- 
puter nodes are accessible. However, in addition to the 
large number of computations associated with solving the 
parquet equations, they also require a complex commu- 
nication pattern between the different processes as we 
discussed in more detail in the Appendix. Moreover, a 
numerical instability is not unexpected, especially when 
the system in the proximity of a phase transition. 
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Forming the ladders 



Dyson-! 




FIG. 1: (Color online) Flow diagram of the algorithm for solv- 
ing the parquet equations. See the text for the description of 
each step. The major computational bottleneck is in the self- 
consistent loop of step 2. The cross channel rotations of the 
vertex ladders from the form required by the Bethe-Salpeter 
equations to that in the parquet equations require expensive 
communications across different nodes in the parallel imple- 
mentation. 



III. NUMERICAL IMPLEMENTATION 

The parquet formulation consists of two sets of equa- 
tions. The first set, made of the parquet equations and 
the Bethe-Salpeter equation, determines the full vertex 
F and the irreducible vertex T given the one-particle 
self-energy £ and the fully irreducible vertex A as the 
inputs. The second set of equations determine the one- 
particle quantities given the full vertex F; it includes the 
Schwinger-Dyson equation and the Dyson equation. 

Since the method is iteration based, the initial guess is 
crucial for obtaining a converged solution. In principle, 
the initial guess can be approximated, for example, by 
second order perturbation. However, in practice, this is 
not the optimal choice, especially when the self-energy 
from the second order perturbation is small. In this case 
the Green function will quickly destabilize the calcula- 
tion. This may relate to the fact that the damping from 
the imaginary part of the self-energy is quickly reduced. 
Since we are supposing that we know the fully irreducible 
vertex A , a practical scheme is to choose the irreducible, 
r and full vertices, F, equal to A, and a large value (a 
few times of the bandwidth) for the imaginary part of 
the self-energy. 

Fig. [T] is an illustration of the flow diagram of the algo- 
rithm, where the fully irreducible vertices are the initial 
input for the calculation. The algorithm can be described 
as the following: 

1. Set the initial conditions for the irreducible and full 
vertices, and the self-energy. 



2. Update the Green's functions and calculate the 
bare susceptibility, Xoi which is given by the product of 
two Green's functions. Solve the parquet and the Bethe- 
Salpeter equations for the irreducible vertices, T. Simple 
iteration is used until the convergence criteria are met 
for the irreducible vertices. 

This completes the update of the vertices. The next 
step is to use the irreducible vertices obtained from the 
parquet equations to construct the full vertices. 

3. Solve the Bethe-Salpeter equation to obtain the full 
vertices, F, using the irreducible vertices from the previ- 
ous step; this is executed exactly by calling the LAPACK 
routines for the inverse of the matrices^ 

With the full vertices obtained, we can update the self- 
energy. 

4. Solve the Dyson- Schwinger equation to obtain the 
self-energy from the full vertices. Simple iteration is used 
until certain convergence criteria are met for the self- 
energy. 

5. Solve the Dyson equation for the fully dressed Green 
function from the self-energy. 

This completes the iteration loop, and the procedure 
is repeated from step 2, until convergence is reached for 
both the self-energy and the irreducible vertices. 

In practice, step 2 which attempts to obtain the irre- 
ducible vertices needs to be iterated for a few times to get 
a reasonable convergence, even in the case where the cou- 
pling is weak and the temperature is high. On the other 
hand, step 4 which attempts to solve the self-energy from 
the updated full vertices is not iterated more than one 
time at each loop, so as to avoid instability (we define 
instability here as the failure to obtain a converged solu- 
tion from the iterative solver). Attempting to solve the 
self-energy at the early stage of the iteration procedure 
where the full vertices are not well converged can gener- 
ally lead to instability. Although in the present paper, 
we only focus on the Hubbard model, instabilities in the 
iteration process have also been observed in solving the 
parquet equations for nuclear structure calculations^ 

A widely used method to avoid the instability in the 
iterative process is to introduce a damping factor in the 
updates of the variables. The updates are modified as 
follows. 

£ = (a)E„ ew + (1 - a)S oW ; (20) 
T = (a)T new + (1 - a)Y old . (21) 

With this damping scheme, the solution for the half-filled 
Hubbard model on a 4 x 4 cluster has been successfully 
obtained for U = 2t and temperature, T = 0.3ii2£. How- 
ever, in the strong coupling regime, obtaining a stable 
solution still seems to be difficult, even though a rather 
heavy damping is employed. 

We will demonstrate the instability problem of the sim- 
ple iterative process by monitoring the leading eigenval- 
ues A defined as 

\ r <t) r = T r (P, P')G{P)G{P + (vr, ir))(j) r (22) 



for r = d and m; similarly 

\ r >4> r , = {-l/2)T r ,(P,P')G{-P)G{P 



(23) 



for r' = s and t. In principle, these leading eigenvalues 
signal a phase transition by going through 1, expressing 
the divergence of the susceptibilities in the corresponding 
channel. 

In Fig. |31 we plot the leading eigenvalues of the den- 
sity, magnetic, spin singlet, and spin triplet channels as 
a function of the number of iteration steps calculated 
with this simple iteration method (SI). The calculation 
is done on a 2 x 2 cluster with 32 frequencies and temper- 
ature T = 0.4i; the damping parameter is a = 0.1. A 
converged solution is obtained for U = 2t; however, for 
U = At and 6t the iterative solutions diverge. Chang- 
ing the damping or the initial self-energy does not help 
in obtaining a converged solution for the larger values of 
U. These are illustrative examples which show the prob- 
lem of using the simple iteration method for solving the 
parquet equations. For weak coupling and not too low 
temperature, converged solution can be obtained. Be- 
yond weak coupling the iteration becomes divergent. 



A. Continuous loading method 
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FIG. 2: (Color online) Diagrammatic representation of the 
six crossing symmetry operations. Note that spin indices are 
hidden for the purpose of clarity. For the first two operations 
(CS 1 & 2), we exchange the lower two external legs. For 
third and fourth operations (CS 3 & 4), we exchange the lower 
two legs and then the right two legs. The first four crossing 
symmetry relationships (CS 1-4) relate the particle-particle 
vertices with the particle-hole vertices. The last two (CS 5 & 
6) are for the particle-hole vertices only, where we exchange 
the lower left and upper right legs. 



A widely used method to alleviate the divergence in 
the non-linear solver is the so-called continuous loading 
or homotopy method. The basis of the continuous load- 
ing method is to construct an auxiliary equation with a 
tuning parameter v 1 so that its solution is trivial for v = 
but the solution of the original equation is recovered for 
v = 1. Symbolically we can write down the set of equa- 
tions of the parquet formulation as f par q ue t (£, T) = 0, 
where fparquet is a large vector. We define the auxiliary 
function as g = vf parquet + (1 — ^)fo, where fo is a func- 
tion with a trivial solution. In our study we choose fo as 
a vector containing all the elements of T — Tq and £ — So, 
where Tq and T,q are the initial guesses for the irreducible 
vertices and the self-energy respectively. The iteration 
method is used to solve the function instead of the 

{parquet- One can readily see that the solution of g(0) is 
trivial while the solution of the f 'parquet is recovered when 
v = 1. The idea is to find a converged solution for g(i/) 
with a small enough v where a converged solution can 
be readily obtained, and then gradually increase v until 
it goes to 1. Therefore, a series of v values are needed, 
which we denote as z^. 

We plot the leading eigenvalues from the continuous 
loading method in Fig. SJ The values of v used are v = 
0.0, 0.5, 0.8, 0.9, 0.95, 0.99, 0.993, 0.996, 0.997, 0.998, 0.999, 
0.9999,1.0. 100 and 200 iterations are performed for 
v < 0.993 and v > 0.996 respectively. For U = 2t, 
converged solution is obtained for v = 1. However, for 
U = At and 6t, the iterative solution diverges. The 
divergences appear before the homotopy parameter is 
pushed to v = 1. These examples illustrate the generic 
behavior when solving the parquet equations by the 



simple iteration method beyond weak coupling. They 
also illustrate that the continuous loading method may 
not be sufficiently robust to solve the problem. The 
damping factor a used in these calculations is 0.1 which 
we believe is a fairly small value, although it may still 
not be sufficient. A rule of thumb for choosing the 
damping parameter is that the damping parameter 
should be close to the value of the inverse of the residual 
between two consecutive iterations; unfortunately, with 
the huge number of variables, this choice will result in a 
very small step and may not be a practical option4£ 

IV. CROSSING SYMMETRY VIOLATION 

The exact solution of the parquet equations automat- 
ically satisfies the crossing symmetry. It is one of the 
most important differences between the parquet formu- 
lation and most other perturbative methods. However, 
within the iteration scheme presented in the last section, 
the crossing symmetry is not fulfilled unless the itera- 
tion converges to an exact solution. From the above sec- 
tion, we clearly find that the iteration method, even with 
the help of the continuous loading scheme, is not robust 
enough to obtain a converged solution beyond weak cou- 
pling. It is desirable to quantify the violation of crossing 
symmetry. The following six equalities are the conse- 
quence of the crossing symmetry (sec Fig. [2] for a di- 
agrammatic representation of these crossing symmetry 
(CS) operations) 

U ee F t {Q) pp , = (24) 



7 



[-(l/2)F m - (l/2)P d ](P + P + Q)_p' i _ P = R u 
L 2 = F S (Q) PP , = (25) 

[-(3/2)F m + (l/2)F d ](P + P + Q)_ P >,_p = R 2 , 

L 3 = F m {Q) pp , = (26) 

[(l/2)F t - (l/2)F t ](P + P + Q)-p-q,-p = R 3 , 
Li = F d (Q) Pp/ = (27) 

[(3/2)F t + (l/2)P t ](P + P' + Q)-p- Q .-p = i? 4 , 

-^5 = Pm(Q)p,p' = (28) 

[(1/2)P TO - (l/2)P d ](P' - P)p,p+q = R 5 , 

L 6 = F m {Q) Pp , = (29) 

[-(3/2)P m + (l/2)P d ](P' - P)p : p+q = R 6 . 

In Fig. [5] we plot the violation of crossing symmetry 
versus the number of iterations. It can be seen clearly 
that the crossing symmetry cannot be perfectly restored, 
even for the case of U = 2t, the measures of violation 
of crossing show oscillatory decreasing behavior, and the 
rate of decrease is quite slow even though the leading 
eigenvalues seem to be well converged. Obviously, for 
the cases of U = At and 6t the crossing symmetry is 
severely violated and at the verge of the divergence there 
are sharp increases in the crossing symmetry violation. 
This may suggest that if the crossing symmetry can be 
restored, the divergence may be avoided beyond weak 
coupling. 

In Fig. IH1 we plot the crossing symmetry violation as 
a function of iteration steps with the continuous load- 
ing method and the same parameters as in Fig. 01 It 
is important to note that the homotopy function g(y) 
does not respect the crossing symmetry except at v = 1. 
Therefore, although the solution is converged, as long as 
v ^ 1, the crossing symmetry is violated. The data for 
U = 2i, At, and 6t are shown in the upper, the middle, 
and the lower panels respectively. The data for U = 2t 
show a peak at the beginning of the iteration procedure 
when v is increased and gradually converges to a finite 
value. For v close to 1, the data show a similar oscillatory 
behavior as that from the simple iteration method. Sim- 
ilar behaviors are also observed for U = At and U = Qt, 
however, the iterations fail to converge for v close to 1; 
the crossing symmetry is strongly violated. Similar to 
that for the simple iteration method, at the verge of the 
divergence, there are sharp increases of the violation of 
crossing symmetry. 



V. SYMMETRY RESTORATION 

A. Crossing Symmetry 

Although it does not seem to be easy to analyze all the 
causes of the instability in the iteration, based on the dis- 
cussion in the above section, our conjecture is that one 
of the possible reasons for the instability is that certain 
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FIG. 3: (Color online) The leading eigenvalues of various 
channels (density (d), magnetic (m), spin singlet (s), and spin 
triplet (t)) as a function of the number of iteration steps calcu- 
lated with the simple iteration (SI) method. The calculations 
are for the half-filled Hubbard model on a 2 x 2 cluster at tem- 
perature T = OAt. The initial condition for the self-energy 
is set at + i320i, and that for the irreducible vertex is set 
at the bare Hubbard coupling. The damping factor a is set 
at 0.1. The solution is well converged for U = 2t. However, 
divergence occurs for U = At at the 55th iteration step. For 
U = 6t, divergence occurs at the 46th iteration step. 



symmetries are violated in the course of the iteration pro- 
cess. A possible strategy to improve the iteration scheme 
is to impose those symmetries explicitly into the iteration 
process, so that at each step of the iteration these sym- 
metries are not violated. The full vertices F obtained 
by the solution of the Bcthc-Salpcter equation cannot 
guarantee the crossing symmetry, unless the exact solu- 
tion is attained. Therefore, so as to preserve the crossing 
symmetry, the simplest method is to use the full vertices 
obtained by solving the Bethe-Salpeter equation (that is 
the full vertices obtained from the step 3 of the algorithm 
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FIG. 4: (Color online) The leading eigenvalues obtained by 
the continuous loading (CL) method. Symbolically we write 
the parquet equations as f P ar?«et(S, T) = and use them to 
define the auxiliary function as g = vf parquet + (1 — v)fo, 
where fo is a function with a trivial solution. We choose fo as 
a vector containing all the elements of Y — To and E — Eo . The 
iteration method is used to solve the function g(v), instead 
of the f parquet- The solution of the f 'parquet is recovered when 
v = 1. A series of v values are needed, which we denote as Vi. 
The function g(vt) is solved by the simple iteration method 
with the initial conditions given by the converged solution of 
the function g(i/{_i). For U = 2t, we can push the value of 
v to 1 to obtain the converged solution for g(y = 1), and 
the solution of the parquet equations is recovered. However, 
the iteration procedure diverges for the cases of U — 4t and 
U = 6i; they diverge at v = 0.9999 and 0.999 respectively. 
These examples show that for the cases where simple iter- 
ation method diverges, the continuous loading method may 
not eliminate the divergence, even though the value of v is 
pushed fairy close to 1. 



FIG. 5: (Color online) Crossing symmetry violation, Ei, ver- 
sus the number of iterations for the simple iteration (SI) 
method with the same parameters as Fig. [3] The six mea- 
sures of crossing symmetry violation are defined as Ei = 
\Li — Ri\/\Li + Ri\, where i = 1, 2, 6; Li and Ri axe defined 
respectively as the left hand side and the right hand side of 
the Eqs. [24] -[29] The data for U — 2t shows an oscillatory 
but decreasing trend. This is expected for the case where the 
iteration provides a well converged solution. One should note 
that although the leading eigenvalues seem to be converged, 
the crossing symmetry is not perfectly constructed from the 
iteration. For U = 4t and U = 6t, the iteration fails to provide 
converged solutions, and the crossing symmetry is strongly vi- 
olated. In particular, at the verge of the divergence, there is 
a sharp increase of the violation of crossing symmetry. 



presented in the section III) , and feed them back into the 
parquet equation to reconstruct the crossing symmetric 
full vertices. Fig. [7] illustrates the flow diagram for solv- 
ing the parquet equations with explicit restoration of the 
crossing symmetry in the full vertices. Here is the algo- 
rithm which explicitly preserves the crossing symmetry: 
1. Set the initial conditions for the irreducible and full 
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FIG. 6: (Color online) Crossing symmetry violation, Ei, ver- 
sus the number of iterations for the continuous loading (CL) 
method with the same parameters and the same definition of 
Ei as in Fig. The homotopy function g(u) does not respect 
the crossing symmetry except at v = 1. Therefore, although 
the solution may be converged for some values of v, as long 
as v 1, the violation of crossing symmetry is non-zero. For 
U — 2t the crossing symmetry violations peak near the begin- 
ning of the iteration procedure where v is small and gradually 
converge to a finite value when v = 1. Similar behaviors are 
also observed for U = 4i and U = 6i; however, since the iter- 
ation fails to converge for v close to 1, the crossing symmetry 
is strongly violated. Similar to that observed in the simple 
iteration method, at the verge of the divergence, there is a 
sharp increase of the violation of crossing symmetry. 



vertices, and the self-energy. 

2. Update the Green's functions and calculate the 
bare susceptibility, xo ■ Solve the parquet and the Bcthc- 
Salpetcr equations for the irreducible vertices, T. Simple 
iteration is used until the convergence criteria are met 
for the irreducible vertices. 

Since we will restore the crossing symmetry of the full 
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FIG. 7: (Color online) Flow diagram of the algorithm for solv- 
ing the parquet equations with crossing symmetry restoration. 
See the text for the description of each step. The main dif- 
ference compared to the previous algorithm is in the step 3b 
where the crossing symmetry is restored explicitly in the full 
vertex, F. Because of this explicit restoration of the crossing 
symmetry, in practice, the step 2 is only iterated for one time 
as the self-consistency is not required to generate the crossing 
symmetry. 



vertices in the step 3b, we find that it is not necessary 
to attain the self-consistency for step 2. In practice, we 
iterate the parquet equations for one time only. The next 
step is to use the irreducible vertices obtained from the 
parquet equations to construct the full vertices. 

3a. Solve the Bethe-Salpeter equation to obtain the 
full vertices, F, using the irreducible vertices from the 
previous step. This is executed exactly by calling the 
LAPACK routines for the inverse of the matrices^ 

3b. Use the new irreducible vertices obtained in step 

2 and the full vertices obtained in step 3a to form the 
vertex ladders. Construct the full vertices from Eqs. [8] 
and IH1 using the vertex ladders. Following these steps, 
the crossing symmetry is restored in the full vertex F. 

With the full vertices obtained, we can update the self- 
energy. 

4. Solve the Dyson-Schwinger equation to obtain the 
self-energy from the full vertices. Simple iteration is used 
until the convergence criteria are met for the self-energy. 

5. Solve the Dyson equation for the fully dressed Green 
function from the self-energy. 

This completes the iteration loop, and the procedure is 
repeated from step 2 until the criteria of convergence are 
met for both the self-energy and the irreducible vertices. 

The main difference between the current algorithm and 
the previous algorithm we present in Section III is in step 

3 where the full vertices are constructed. In the previ- 
ous algorithm the Bethc-Salpcter equation is solved many 
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times to attain convergence. When the absolute conver- 
gence is attained, the crossing symmetry will be satis- 
fied. In the current algorithm, we just explicitly solve 
the Bethe-Salpeter equation in step 3a to refresh the full 
vertices. Once we obtain the full vertices, in step 3b, we 
construct the new vertex ladders and the new full ver- 
tices from the vertex ladders using the Bcthc-Salpctcr 
equation. By doing so, the crossing symmetry of the full 
vertices will be satisfied; see Eqs. [T6J [T7l [18J an d [HI In 
Fig. [5] we show the leading eigenvalues using the same set 
of parameters used in Fig. [3] While the simple iteration 
scheme without crossing symmetry fails to converge for 
the case of U = 4 and 6, it provides a converged solution 
when the crossing symmetry is explicitly restored. 



B. Time- reversal and Point Group Symmetry 

Besides imposing the crossing symmetry on the full 
vertices, some of the internal symmetries can also be 
imposed on the irreducible vertices and the self-energy 
without much computational overhead. We illustrate the 
time-reversal symmetry in the self-energy 



E(k,iw) = E*(k,-io;) 



(30) 



and the vertices (spatial reflection symmetry and parity 
invariance are assumed), 



F{Q)p,p> =F(Q)p\p- 



(31) 



As these symmetry operations do not mix vertices 
across different values of Q, and providing that the data 
is distributed with one or more Q at each node, the time 
reversal symmetry of the vertices can be imposed without 
invoking communications across different nodes. There- 
fore, enforcing time-reversal symmetry will only cause a 
very minor computational overhead. 

Other symmetries, such as the point group symmetry 
for the square lattice can be rather cumbersome. An 
expensive scheme involving heavy internodc communi- 
cation would be required to impose the complete set of 
point group symmetries. However, we may impose an im- 
portant subset of the operations R a for which R a (Q) = Q 
without expensive communications. In these cases, the 
vertices may be symmetrized by performing the sum 

F{Q)p P > = — ^2 F(Q) R , P) R , P >s, 

N Rai Q } =Q R ^ =Q 

(32) 

where -^ij Q (Q)=Q is the number of elements in this subset 
of operations. For general Q in the cluster Brillouin zone 
there would be no a such that R a (Q) = Q apart from 
the identity. However, for the points of high symmetry, 
Ra(Q) = Q for all a. Generally, the instabilities first 
occur here, so imposing the point group symmetries at 
these Q values should have the greatest impact. 

In Fig. [HI we show the leading eigenvalues of various 
channels when both crossing and time-reversal symme- 
tries are imposed for the same set of parameters being 
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FIG. 8: (Color online) The leading eigenvalues of various 
channels (density (d), magnetic (m), spin singlet (s), and spin 
triplet (t)) versus the number of iterations with the simple in- 
teraction (SI) method with crossing symmetry (CS). The pa- 
rameters used are the same as the data shown in Fig. [3] The 
only difference is that the crossing symmetry in the full ver- 
tex, F, is explicitly restored at each step of the iteration. This 
is easily achieved by constructing the full vertex directly from 
Eqs. [8] and [9] The simple iteration scheme without crossing 
symmetry fails for the case of U = 4 and 6. With the cross- 
ing symmetry explicitly restored, converged solutions are ob- 
tained. 



used from the data in Fig. [3] and [8j Spatial reflection 
symmetry, parity invariance and spin rotation invariance 
are assumed as appropriate for the two-dimensional Hub- 
bard model at non-zero temperature. We can see that 
there is only very marginal improvement for the conver- 
gence compared to the results without explicitly restor- 
ing the time- reversal symmetry (see Fig. [8j. We also use 
the scheme described above to partially impose the point 
group symmetries. However, these symmetries resulted 
in no additional improvements and therefore no results 
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FIG. 9: (Color online) The leading eigenvalues of various 
channels: density (d), magnetic (m), spin singlet (s), and 
spin triplet (t), as a function of the iteration. The param- 
eters used are the same as the data shown in Fig. [3] Two 
symmetries are explicitly restored at each step of the itera- 
tion: the crossing symmetry (CS) in the full vertex, F, and 
the time-reversal symmetry (TRS) for the self-energy, E, and 
both the irreducible vertex, F, and the full vertex, F, (that 
is F(Q) pp i = F(Q) p i p and similarly for F). Notice there 
is no substantial gain in the convergent rate compared to the 
case with only the crossing symmetry being restored. 



are shown. 



VI. LEADING EIGENVALUE OF THE 
ANTIFERROMAGNETIC CHANNEL 

With the improved scheme proposed in this paper, we 
are able to explore a wider range of temperature and 
coupling strength for the half-filled Hubbard model. In 
Fig.[TU]wc show the leading eigenvalue for the most singu- 
lar channel, the antiferromagnctic channel, A m , as a func- 




FIG. 10: (Color online) The leading eigenvalues of the an- 
tiferromagnetic magnetic channel, A m , as a function of the 
coupling, U, calculated with the simple iteration method. 
Different curves correspond to different temperatures. The 
data points enclosed in a black square correspond to the cases 
where the simple iteration without any symmetry restoration 
can provide a converged solution. 

tion of U for a range of temperatures as low as T = 0.15t. 
The data points enclosed in a black square correspond to 
the cases where the simple iteration without any sym- 
metry restoration provides a converged solution. For 
all temperatures, X m increase sharply at weak coupling 
(U ~ 2t), and they tend to saturate at strong coupling 
(U ~ 6i). They are most sensitive to temperature at 
the intermediate coupling (2t < U < At). We emphasize 
that convergency is not possible without the improved 
scheme, unless a large number of iterations are used to 
attain the crossing symmetry. 

VII. SUMMARY AND DISCUSSION 

We present improvements of numerical implementa- 
tions for solving the parquet equations for the Hubbard 
model. The main strategy is to enforce the symmetries in 
the iteration process. The most prominent advantage of 
the parquet formulation, compared to most of the other 
approaches, is that the crossing symmetry is exactly ful- 
filled. However, in general, it is true only if the exact 
solution is found. With the simple iteration method, 
the crossing symmetry is strongly violated prior an in- 
stability, suggesting that the instability is due to these 
symmetry violations. 

The continuous loading or homotopy method does not 
improve convergence significantly beyond the simple iter- 
ation method. We note that the solutions of the contin- 
uous loading function do not preserve crossing symme- 
try. This may partly explain why the continuous loading 
method does not provide significant improvement over 
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simple iteration. 

We present a simple method to enforce the crossing 
symmetry at each step of the iteration which does not 
substantially increase the computational cost. The addi- 
tion of these symmetry constraints can greatly improve 
the stability of the calculation, so that a wider range of 
parameters can be explored by the parquet formulation. 
Along this line of thought, one can expect that the sta- 
bility may be further improved if other symmetries are 
also imposed, the obvious ones being time-reversal and 
point group. However, these additional symmetries did 
not improve the stability significantly beyond that ob- 
tained with crossing symmetry alone. 

The parquet formulation still remains as one of the 
best approaches for calculating the two-particle vertex 
functions in a self-consistent manner. At present, solv- 
ing the parquet equations for a large lattice size is still a 
very challenging task; however, with the continuous ad- 
vances of computational facilities, it should become more 
feasible in the foreseeable future. A promising direction, 
which allows immediate application of the parquet for- 
mulation, is to incorporate it as part of the multi-scale 
many-body approach . 47 ' 48 
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Appendix A: Parallel Implementation with Latency 
Hiding 

This appendix describes a highly effective implemen- 
tation of the symmetry-enforcing variant of the parquet 
formulation described earlier in the paper. The commu- 
nication bottleneck in this implementation is the expen- 
sive tensor rotations required to rotate the vertex ladders 
(Eqs. QJJ and [TTj) between the forms used in the Bethe- 
Salpeter equation, Eqs.[S]and|ni to those used in the par- 
quet equations, Eqs. H3]-[T5j If we distribute the equa- 
tions between the processes executing on the compute 
nodes of a parallel machine using the transfer momenta 
Q, the tensor rotations are done with an expensive all-to- 
all communication among those processes, in which every 
node needs to communicate with all the other nodes. The 



MPI implementation of the all-to-all communication 4 ^ is 
a collective operation that is blocking, i.e., each process 
has to wait until the message has been sent out. The 
key aspect of our implementation is the decomposition of 
the required communication so that non-blocking com- 
munication primitives can be effectively utilized. The 
non-blocking communication enables latency hiding by 
overlapping computations and communications. 

Four different forms of tensor rotations are required: 



rotation 1 : $(Q)p,p' 
rotation 2 : $(Q)p,p> 
rotation 3 : <l>((2)pp< 
rotation 4 : $(Q)p ; p/ 



*(P' - P)p,p+q (Al) 
$(P' - P)-P,, P+Q (A2) 
$(P + P' + 0)_ P ,,_p(A3) 
$(P + P' + Q)_p_ q ,_p. 

(A4) 



Note that the indices in subscripts and those in paren- 
thesis are equivalent, with the latter only distinguished 
by also labeling the nodes where data is distributed. The 
size of the tensors is N t x N t x N t , where N t is the num- 
ber of momentum points times the number of discrete 
Matsubara frequencies, i.e., Nt = Nk X N u . All indices 
are in modulo arithmetic at each of the D + 1 dimen- 
sions, where D = 2 (the cluster dimension), and the "1" 
is for the Matsubara frequency. Because it takes many 
iterations (up to a few hundred for low temperatures and 
strong coupling) to obtain converged solutions, the total 
number of tensor rotations is significant and account for 
a large fraction of the computational time. 

We use the hybrid MPI/OpenMP model for the com- 
putations. The rank three tensors are decomposed and 
evenly distributed into N virtual nodes. Each virtual 
node consists of a few cores. The size of a virtual node 
(i.e., the number of cores) is less than or equal to the size 
of a physical node. Specifically, we slice the rank three 
tensors to a set of two-dimensional arrays based on the 
index in parenthesis, e.g., Q and P — P' for the left and 
right sides of Eq. IA11 respectively. Then, each two di- 
mensional matrix is assigned to a virtual node. Since we 
have N t layers of two dimensional slices, the total number 
of virtual nodes also becomes N = N t . In this scenario, 
every rotation requires data communications among all 
nodes. The following describes the data access patterns 
for our implementation of the tensor rotations. 
Step 1: This step involves no MPI communication and 
is done before any data is sent between nodes. The tensor 
elements are locally rearranged in order to collect specific 
elements to be grouped and sent to designated destina- 
tion nodes. The index in parenthesis of the tensors on 
the right of Eqs. I All IA4I represents the rank of a sending 
node in which a sliced two dimensional matrix resides. 
For rotations 1 and 2, rank of sending node S is 



S = P' -P. 
For rotations 3 and 4, S is 

S = P + P' + Q. 



(A5) 



(A6) 
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Using Eqs. IA5l and IA61 and applying these to the corre- 
sponding rotations, the two dimensional matrix elements 
are grouped based on the rank of destination node Q 
from a given sending node S. 



rotation 1 : Apq 
rotation 2 : Ap^q 
rotation 3 : Ap_q 
rotation 4 : Ap t q 



$(S)p, p+q (A7) 

$(S)_ (P+S) ,p + Q (A8) 

<S>(S) P+Q -s,-p (A9) 

HS)-(p+q),-p (A10) 



Note that, here, S is the node index (the index of the 
sender) and P, Q G {0, . . . , Nt — 1}, so the P and Q are 
the row and column indices of the matrix. We assume 
column-major order data access in MPI data communi- 
cations which distribute columns of matrix A to nodes 
of rank Q in the next step. 

Step 2: The columns of the two dimensional ma- 
trix A are distributed among all nodes. At the send- 
ing nodes, each column of A is sent to a destina- 
tion node labeled by Q. The standard approach is to 
use MPLALLTOALL. However, as we show later, this 
task can be done using different combinations of point- 
to-point communications^ In particular, non-blocking 
communication protocols can be applied to overlap com- 
munications and local computations. Overall, this pro- 
cedure is applied to all the tensor rotations and can be 
written as 

Bp,s a t rank Q node: 4— Apq at rank S node. (All) 



As shown in Eq. IA111 the rank of destination nodes is 
determined by the column index Q of A in sending nodes. 
The rank of sending nodes becomes column index S of 
B in the receiving nodes. The rank of sending nodes S 
must be provided to receiving nodes in order to assign 
the correct column index to the received messages. 
Step 3: Once messages have arrived at the destination 
nodes, the columns of the two dimensional matrix B arc 
rearranged to complete the tensor rotations. The column 
index of the rotated received matrix is related to the rank 
of the sending and receiving nodes by Eqs. IA5I and lA6l 

Then, the rotations are finalized by using the following 
relations 



rotation 1, 2 
rotation 3, 4 



HQ) 
HQ) 



P,S+P 
P,S-{P+Q) 



D 



p,s 
B 



p,s, 



(A12) 
(A13) 



where Q is the index of a given receiving node and P, S £ 
{0, ...,Nt — 1}. This step is a local process, i.e., no 
intcrnode communication is necessary. 



Improving the Performance of Tensor Rotations 

While steps 1 and 3 are strictly local processes, step 
2 is the only stage involving nonlocal MPI communi- 
cations. The nature of the collective communications 
among all nodes in step 2 makes it suited to the use of 



MPLALLTOALL. In such a case, step 2 can start only 
after the completion of step 1. Because MPLALLTOALL 
is a blocking communication, step 3 must wait to start 
until step 2 is finished. Therefore, the total elapsed time 
to complete a tensor rotation is the sum of elapsed times 
of the three steps. When the problem size is large, the 
communication efficiency of MPLALLTOALL is reduced 
significantly due to the increased network complexity as- 
sociated with the bandwidth and latency among all par- 
ticipating nodes. Our approach to handle these rotations 
more efficiently is to implement a latency hiding strategy 
by overlapping message communications (step 2) and lo- 
cal computations (steps 1 and 3). 

To enable this, we have developed our own version of a 
routine that performs communications from all nodes to 
all nodes. At a basic level, the functionality of this rou- 
tine is identical to that of the generic MPLALLTOALL 
routine. However, our routine allows further data ma- 
nipulations such that local computations are embedded 
between communications in the following way: On the 
sending node, the first column of A is computed from 
the equations of step 1. Then, MPI_ISEND sends out the 
first column of A. While this column is being sent out, the 
next column of A is prepared with step 1. This procedure 
is repeated until all Nt columns of A, the group of the 
selected elements from are sent out. This process over- 
laps steps 1 and 2. Latency hiding is also implemented in 
receiving nodes. We note that the sending nodes are also 
receiving nodes. They only differ by whether they are 
operating in the sending or the receiving mode. On the 
receiving nodes, MPLIRECV is set to receive messages 
from arbitrary nodes by using MPLANY.SO URGE as a 
tag identifying the source of the message. For efficiency 
reasons, MPIJRECV is posted before MPLISEND of 
the sending process. Then, MPLTEST calls are used 
to check the completion of the arrival of the message. 
Once message arrival is confirmed, the rank of the node 
that sent this message can be identified by inquiring using 
MPLSTATUS. This provides S to assign to a correspond- 
ing column and to be used in step 3. Since the message 
arrival is column-by-column, the processing of each col- 
umn of B continues to step 3 while the next column is 
traveling through the network. This process is repeated 
until all columns are completed. This procedure com- 
pletely overlaps steps 2 and 3. 

Depending on the size of problem, it is desirable to 
define a virtual node containing several cores (assuming 
multicorc hardware architecture) based on the memory 
availability per node. Among the cores, MPI commu- 
nications are assigned to one core. The other cores are 
utilized by implementing OpenMP— that parallelizes the 
local computational tasks in a node to all cores within the 
node. Thus, OpenMP thread depth is set to match with 
the total number of cores per virtual node. Specifically, 
we applied the DO directive of OpenMP for iterations of 
index P in the column selection processes of steps 1 and 
3. 
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Experimental Results 

We test the efficiency of this latency hiding scheme 
using a non-blocking protocol against the standard 
MPI_ALLTOALL. All the experimental comparisons are 
conducted on the Cray XT5 (Jaguar) at the National 
Center for Computational Sciences (NCCS) at the Oak 
Ridge National Laboratory. Jaguar consists of 12 cores 
per node, with six cores per NUMA (Non-Uniform Mem- 
ory Access) node, and two NUMAs per node. First, we 
discuss hardware-driven constraints in implementing la- 
tency hiding. The non-blocking MPIJSEND docs not 
check for the arrivals of messages. With larger tensor 
size, the node usage and the size of individual columns 
becomes large. The MPLISEND from all participating 
nodes tries to dump a large column in each iteration. The 
next iteration starts regardless of message arrivals in the 
receiving nodes. As a consequence, a large amount of 
data rushes onto the network faster than the data can be 
absorbed by the receiving nodes. Eventually, this causes 
memory overflow to the system buffer assigned to the 
message processing unit. To avoid this we have allocated 
more memory space to the system buffer. 



cores per a virtual node. Smaller core usage per node 
alleviates the buffer restriction. For example, hardware 
setup with a NUMA node per virtual node consumes less 
buffer memory due to the reduced total number of physi- 
cal nodes participating in internode communication. We 
did not observe buffer memory overflow with the generic 
blocking MPLALLTOALL routine. 
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FIG. 11: (Color online) Required minimum buffer size to ex- 
ecute our all-to-all routine; each node has 12 cores. 

For simplicity, we assign one virtual node to a phys- 
ical node. On the Jaguar Cray XT5, this means one 
virtual node containing 12 cores. To utilize all cores in 
a node, the value of OpcnMP thread depth is set to 12. 
We gradually increase the problem size N t until jobs end 
with error indicating buffer overflow. Then, we set a 
higher buffer size by controlling environmental variable 
MPICH_UNEX_BUFFER_SIZE. For every incidence of 
error, we add 60 MB buffer size. The default value of 
MPICH_UNEX_BUFFER_SIZE is 60 MB on JAGUAR 
XT5 (the total number of cores is less than 50,000). The 
results are shown in the Fig. [TT] Up to Nt = 1024, the 
60 MB default buffer size is enough to handle the data 
traffic. Increasing Nt further forces us to use a larger 
buffer size. Overall, the amount of added buffer size in- 
creases for larger problem sizes. We note that the results 
presented in Fig. [TTJ are with the maximum number of 



FIG. 12: (Color online) Time spent in data communication 
as a function of the number of computer nodes (12 processors 
per node). For large data sets each process sends messages 
to all the others, and the communication time scales linearly 
with the number of processes. Latency hiding techniques that 
overlap the interprocessor communication with local compu- 
tations yields a factor of two speedup when compared with 
the standard MPLALLTOALL implementations as the num- 
ber of processors increases beyond 30,000. 

The performance of the latency hiding approach is 
evaluated in terms of wallclock time spent on a single 
tensor rotation and compared with the case of the stan- 
dard MPLALLTOALL applied for step 2. For this, the 
elapsed time to complete the tensor rotation is averaged 
over nine independent runs. Each run contains 40 repe- 
titions of identical tensor rotations. At the end of each 
run, the elapsed time is also averaged for the 40 rota- 
tions. For all runs, we choose rotation 1 and the min- 
imum buffer sizes shown in the Fig. [TT] are assumed. 
The comparison results are shown in Fig. [T^] Except 
for Nt = 768 and 2048, latency hiding outperforms the 
case without latency hiding in significant amount. The 
performance differences are even higher for Nt > 2304. 
For the MPL_ALLTOALL case, there is a sudden speed- 
up at Nt = 2048. We are exploring this behavior further. 
We believe that it is caused by changes in the data traffic 
controlled by the hardware. 

Overall, latency hiding provides a higher speed-up for 
larger tensor sizes and core count. From the general 
trends, it can be expected that two-fold or more efficiency 
improvement for Nt greater than 2816 can be obtained 
by implementing latency hiding with our non-blocking 
adaptation of the all-to-all routine for tensor rotations. 
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